Learning similarity-based word sense disambiguation 

from sparse data 



G^ 



(N 

> 



Ph. 



X 



Yael Karov and Shimon Edelman 
Dept. of Applied Mathematics and Computer Science 
The Weizmann Institute of Science 

Rehovot 76100, Israel 
[yaelk, edelman] (9wisdom.weizmann.ac.il 

February 5, 2008 



Abstract 



We describe a method for automatic word sense disambiguation using a text corpus and 
a macliine-readable dictionary (MRD). The method is based on word similarity and context 
similarity measures. Words are considered similar if they appear in similar contexts; contexts are 
similar if they contain similar words. The circularity of this definition is resolved by an iterative, 
converging process, in which the system learns from the corpus a set of typical usages for each 
Q>^ ' of the senses of the polysemous word listed in the MRD. A new instance of a polysemous word 

. is assigned the sense associated with the typical usage most similar to its context. Experiments 

' show that this method performs well, and can learn even from very sparse training data. 

o 

^ : Introduction 

Word Sense Disambiguation (WSD) is the problem of assigning a sense to an ambiguous word, 
using its context. We assume that different senses of a word correspond to different entries in its 
dictionary definition. For example, suit has two senses listed in a dictionary: an action in court, 
Q , and suit of clothes. Given the sentence The union's lawyers are reviewing the suit, we would like 

^ I the system to decide automatically that suit is used there in its court-related sense (we assume 

that the part of speech of the polysemous word is known). 

In recent years, text corpora have been the main source of information for learning automatic 



I WSD (see, e.g., ( Gale et al., 1992D ). A typical corpus-based algorithm constructs a training set 



from all contexts of a polysemous word W in the corpus, and uses it to learn a classifier that maps 
instances of W (each supplied with its context) into the senses. Because learning requires that the 
examples in the training set be partitioned into the different senses, and because sense information 
is not available in the corpus explicitly, this approach depends critically on manual sense tagging — 
a laborious and time-consuming process that has to be repeated for every word, in every language, 
and, more likely than not, for every topic of discourse or source of information. 

The need for tagged examples creates a problem referred to in previous works as the knowledge 
acquisition bottleneck: training a disambiguator for W requires that the examples in the corpus be 
partitioned into senses, which, in turn, requires a fully operational disambiguator. The method we 
propose circumvents this problem by automatically tagging the training set examples for W using 
other examples, that do not contain W, but do contain related words extracted from its dictionary 
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definition. For instance, in the training set for suit, we would use, in addition to the contexts of 
suit, all the contexts of court and of clothes in the corpus, because court and clothes appear 
in the MRD entry of suit that defines its two senses. Note that, unlike the contexts of suit, which 
may discuss either court action or clothing, the contexts of court are not likely to be especially 
related to clothing, and, similarly, those of clothes will normally have little to do with lawsuits. 
We will use this observation to tag the original contexts of suit. 

Another problem that affects the corpus-based WSD methods is the sparseness of data: these 
methods typically rely on the statistics of cooccurrences of words, while many of the possible 
cooccurrences are not observed even in a very large corpus ( [Church and Mercer, 199^ ) . We address 
this problem in several ways. First, instead of tallying word statistics for the examples of each 
sense (which may be unreliable when the examples are few), we collect sentence- level statistics, 
representing each sentence by the set of features it contains (more on features in section 3.2). 
Second, we define a similarity measure on the feature space, which allows us to pool the statistics 
of similar features. Third, in addition to the examples of the polysemous word W in the corpus, we 
learn also from the examples of all the words in the dictionary definition of W. In our experiments, 
this resulted in a training set that could be up to 20 times larger than the set of original examples. 

The rest of this paper is organized as follows. Section |l] describes the approach we have de- 
veloped. In section ^, we report the results of tests we have conducted on the Treebank-2 corpus. 
Section ^ concludes with a discussion of related methods and a summary. Proofs and other details 
of our scheme can be found in the appendix. 



1 Similarity-based disambiguation 

Our aim is to have the system learn to disambiguate the appearances of a polysemous word W with 
senses si, . . . ,Sfc, using the appearances of W in an untagged corpus as examples. To avoid the 
need to tag the training examples manually, we augment the training set by additional sense-related 
examples, which we call a feedback set. The feedback set for sense Sj of word W is the union of all 
contexts that contain some noun found in the entry of Sj(W) in a MRD0 (high-frequency nouns, 
and nouns in the intersection of any two sense entries, as well as examples in the intersection of two 
feedback sets, are discarded). The feedback sets can be augmented, in turn, by original training-set 
sentences that are closely related (in a sense defined below) to one of the feedback set sentences; 
these additional examples can then attract other original examples. 

The feedback sets constitute a rich source of data that are known to be sorted by sense. Specif- 
ically, the feedback set of Sj is known to be more closely related to Sj than to the other senses of the 
same word. We rely on this observation to tag automatically the examples of W, as follows. Each 
original sentence containing W is assigned the sense of its most similar sentence in the feedback 
sets. Two sentences are considered to be similar insofar as they contain similar words (they do not 
have to share any word); words are considered to be similar if they appear in similar sentences. 
The circularity of this definition is resolved by an iterative, converging process, described below. 

1.1 Terminology 

A context, or example of the target word W is any sentence that contains W, and (optionally) 
the two adjacent sentences in the corpus. The features of a sentence are its nouns, verbs, and 
the adjectives of W and of the nouns from Ws MRD definition, all used after stemming (it is 

^By MRD we mean a machine-readable dictionary or a thesaurus, or any combination of such knowledge sources. 
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also possible to use other types of features, such as word n-grams or syntactic information; see 



section 3^). As the number of features in the training data can be very large, we automatically 



assign each relevant feature a weight indicating the extent to which it is indicative of the sense 



(see section [A.Sl) . Features that appear less than two times, and features whose weight falls under 
a certain threshold are excluded. A sentence is represented by the set of the remaining relevant 
features it contains. 

1.2 Computation of similarity 

Our method hinges on the possibility to compute similarity between the original contexts of W and 
the sentences in the feedback sets. We concentrate on similarities in the way sentences use W, and 
not in their meaning. Thus, similar words tend to appear in similar contexts, and their textual 
proximity to the ambiguous word W is indicative of the sense of W. Note that contextually similar 
words do not have to be synonyms, or to belong to the same lexical category. For example, we 
consider the words doctor and health to be similar because they frequently share contexts, although 
they are far removed from each other in a typical semantic hierarchy such as the WordNet ( [Miller 



let al., 1993| ). Note, further, that b ecause we learn similarity from the training set of W, and not 
from the entire corpus, it tends to capture regularities with respect to the usage of W, rather than 
abstract or general regularities. For example, the otherwise unrelated words war and trafficking are 
similar in the contexts of the polysemous word drug {narcotic/ medicine) , because the expressions 
drug trafficking and the war on drugs appear in related contexts of drug. As a result, both war and 
trafficking are similar in being strongly indicative of the narcotic sense of drug. 

Words and sentences play complementary roles in our approach: a sentence is represented by 
the set of words it contains, and a word — by the set of sentences in which it appears. Sentences 
are similar to the extent they contain similar words; words are similar to the extent they appear 
in similar sentences. Although this definition is circular, it turns out to be of great use, if applied 
iteratively, as described below. 

In each iteration, we update a word similarity matrix M^^\ whose rows and columns are labeled 
by all the words encountered in the training set of W. In that matrix, the cell M^'^\i,j) holds a 
value between and 1, indicating the extent to which word i is contextually similar to word j. In 

(s) 

addition, we keep and update a separate sentence similarity matrix M- for each sense Sj of W 
(including a matrix Mq^^ that contains the similarities of the original examples to themselves) . The 

(s) 

rows in a sentence matrix M- ' correspond to the original examples of W, and the columns — to 
the original examples of W for i = 0, and to the feedback-set examples for sense Sj, for z > 0. 

To compute the similarities, we initialize the word similarity matrix to the identity matrix (each 
word is fully similar to itself, and completely dissimilar to other words), and iterate (see Figure |l]): 

1. update the sentence similarity matrices M^^\ using the word similarity matrix M^^^; 

2. update the word similarity matrix M'^^\ using the sentence similarity matrices M^^\ 



until the changes in the similarity values are small enough (see section A.l for a detailed description 



of the stopping conditions; a proof of convergence appears in the appendix). 
1.2.1 The affinity formula 

The algorithm for updating the similarity matrices involves an auxiliary relation between words 
and sentences, which we call affinity, introduced to simplify the symmetric iterative treatment of 



3 



Itiiualizcd 

Word 

Similarilv 




Figure 1: Iterative computation of word and sentence similarities. 

similarity between words and sentences. A word W is assumed to have a certain affinity to every 
sentence. Affinity (a real number between and 1) reflects the contextual relationships between W 
and the words of the sentence. If W belongs to a sentence S, its affinity to S is 1; if W is totally 
unrelated to S, the affinity is close to (this is the most common case); if W is contextually similar 
to the words of S, its affinity to S is between and 1. In a symmetric manner, a sentence S has 
some affinity to every word, reflecting the similarity of S to sentences involving that word. 

We say that a word belongs to a sentence, denoted as W £ S, if it textually contained there; in 
this case, sentence is said to include the word: S B W. Affinity is then defined as follows: 

aSn{W, S) = maxsim„(>V, Wi) (1) 
affn(5, W) = max sim„(<S, Sj) (2) 

where n denotes the iteration number .0 The initial representation of a sentence, as the set of words 
that it directly contains, is now augmented by a similarity-based representation; The sentence 
contains more information or features than the words directly contained in it. Every word has some 
affinity to the sentence, and the sentence can be represented by a vector indicating the affinity of 
each word to it. Similarly, every word can be represented by the affinity of every sentence to it. 
Note that affinity is asymmetric: aff (5, W) 7^ aff (W, S), because W may be similar to one of the 
words in S, which, however, is not one of the topic words of S; it is not an important word in s. In 
this case, aff(W, S) is high, because W is similar to a word in S, but aff(5, W) is low, because S 
is not a representative example of the usage of the word W. 

1.2.2 The similarity formula 

We define the similarity of Wi to W2 to be the average affinity of sentences that include Wi to 
those that include VV2 . The similarity of a sentence Si to another sentence ^2 is a weighted average 
of the affinity of the words in iSi to those in ^2: 

^At a first glance it may seem that the mean rather than the maximal similarity of W to the words of a sentence 
should determine the affinity between the two. However, any definition of affinity that takes into account more words 
than just the one with the maximal similarity to W, may result in a word being directly contained in the sentence, 
but having an affinity to it that is smaller than 1. 
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sim„+i(<Si, 52) = 5] weight(W, 5i) • aff„(W, 52) (3) 
sim„+i(>Vi, W2) = 5] weight(5, Wi) • aff„(5, W2) (4) 

where the weights sum to 1.^ 

1.2.3 The importance of iteration 

Initially, only identical words are considered similar, so that aff(W, 5) = 1 if W G 5; the affinity 
is zero otherwise. Thus, in the first iteration, the similarity between Si and ^2 depends on the 
number of words from Si that appear in ^2, divided by the length of ^2 (note that each word 
may carry a different weight). In the subsequent iterations, each word W € 5i contributes to the 
similarity of Si to ^2 a value between and 1, indicating its affinity to 52, instead of voting either 
(if W G 52) or 1 (if W 52). Analogously, sentences contribute values to word similarity. 

One may view the iterations as successively capturing parameterized "genealogical" relation- 
ships. Let words that share contexts be called direct relatives; then words that share neighbors 
(have similar cooccurrence patterns) are once-removed relatives. These two family relationships are 
captured by the first iteration, and also by most traditional similarity measures, which are based 
on cooccurrences. The second iteration then brings together twice-removed relatives. The third 
iteration captures higher similarity relationships, and so on. Note that the level of relationship here 
is a gradually consolidated real-valued quantity, and is dictated by the amount and the quality of 
the evidence gleaned from the corpus; it is not an all-or-none "relatedness" tag, as in genealogy. 

The following simple example demonstrates the difference between our similarity measure and 
pure cooccurrence-based similarity measures, which cannot capture higher-order relationships. 
Consider the set of three sentence fragments: 

si: eat banana 
s2: taste banana 
s3: eat apple 

In this "corpus," the similarity of taste and apple, according to the cooccurrence-based methods, 
is 0, because the contexts of these two words are disjoint. In comparison, our iterative algorithm 
will capture some similarity: 

• Initialization. Every word is similar to itself only. 

• First iteration. The sentences eat banana and eat apple have similarity of 0.5, because of the 
common word eat. Furthermore, the sentences eat banana and taste banana have similarity 0.5: 

— banana is learned to be similar to apple because of their common usage {eat banana and 
eat apple); 



^The weight of a word estimates its expected contribution to the disambiguation task, and is a product of several 
factors: the frequency of the word in the corpus, its frequency in the training set relative to that in the entire co rpus ; 



the textual distance from the target word, and its part of speech (more details on word weights appear in section A. 3). 
All the sentences that include a given word are assigned identical weights 
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— taste is similar to eat because of their common usage (taste banana and eat banana); 

— taste and apple are not similar (yet). 



• Second iteration. The sentence taste banana has now some similarity to eat apple, because 
in the previous iteration taste was similar to eat and banana was similar to apple. The word 
taste is now similar to apple because the taste sentence (taste banana) is similar to the apple 
sentence (eat apple). Yet, banana is more similar to apple than taste, because the similarity 
value of banana and apple further increases in the second iteration. 



This simple example demonstrates the transitivity of our similarity measure, which allows it to 
extract high-order contextual relationships. In more complex situations, the transitivity-dependent 
spread of similarity is slower, because each word is represented by many more sentences. Iteration 



stops when the changes in the similarity values are small enough (see section AJ). In practice, this 



happens after about three iterations, which, intuitively, suffice to exhaust the transitive exploration 
of similarities. After that, although the similarity values may continue to increase, their rank order 
does not change significantly. That is, if in the third iteration a sentence S was more similar to 71 
than to T2, this order will, by and large, prevail also in the subsequent iterations, even though the 
similarity values may still increase. 

The most important properties of the similarity computation algorithm are convergence (see 
appendix |A.2| ), and utility in supporting disambiguation (described in section three other prop- 
erties are as follows. First, word similarity computed according to the above algorithm is asymmet- 
ric. For example, drug is more similar to traffic than traffic is to drug, because traffic is mentioned 
more frequently in drug contexts than drug is mentioned in contexts of traffic (which has many 
other usages). Likewise, sentence similarity is asymmetric: if Si is fully contained in ^2, then 
sim(5i, 52) = 1, whereas sim(52. Si) < 1. Second, words with a small count in the training 
set will have unreliable similarity values. These, however, are multiplied by a very low weight 
when used in sentence similarity evaluation, because the frequency in the training set is taken into 
account in computing the word weights. Third, in the computation of sim(Wi, W2) for a very 
frequent VV2, the set of its sentences is very large, potentially inflating the affinity of Wi to the 
sentences that contain W2. We counter this tendency by multiplying sim(>Vi, W2) by a weight 
that is reciprocally related to the global frequency of VV2 . 



1.3 Using similarity to tag the training set 

Following convergence, each sentence in the training set is assigned the sense of its most similar 
sentence in one of the feedback sets of sense Sj, using the final sentence similarity matrix. Note 
that some sentences in the training set belong also to one of the feedback sets, because they contain 
words from the MRD definition of the target word. Those sentences are automatically assigned the 
sense of the feedback set to which they belong, since they are most similar to themselves. Note 
also that an original training-set sentence S can be attracted to a sentence F from a feedback set, 
even if S and F do not share any word, because of the transitivity of the similarity measure. 



1.4 Learning the typical uses of each sense 

We partition the examples of each sense into typical use sets, by grouping all the sentences that 
were attracted to the same feedback-set sentence. That sentence, and all the original sentences 
attracted to it, form a class of examples for a typical usage. Feedback-set examples that did not 
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attract any original sentences are discarded. If the number of resulting classes is too high, further 
clustering can be carried out on the basis of the distance metric defined by 1 — sim(x, y), where 
sim(x, y) are values taken from the final sentence similarity matrix. 

A typical usage of a sense is represented by the affinity information generalized from its exam- 
ples. For each word W, and each cluster C of examples of the same usage, we define: 



aff(W, C) = maxaff(>V, 5) (5) 
= max max sim(W, W,) (6) 

For each cluster we construct its affinity vector, whose z'th component indicates the affinity of word i 
to the cluster. It suffices to generalize the affinity information (rather than similarity), because 
new examples are judged on the basis of their similarity to each cluster: in the computation of 
sim(«Si, S2) (equation 1^, the only information concerning 1S2 is its affinity values. 

1.5 Testing new examples 

Given a new sentence S containing a target word W, we determine its sense by computing the 
similarity of S to each of the previously obtained clusters Ck, and returning the sense of the most 
similar cluster: 



sim(5„e«„ Ck) = weight(>V, Snew) ■ aff(>V, Ck) (7) 

sim(5„e^. Si) = maxsim(5„e^, C) (8) 

2 Experimental evaluation of the method 

We tested the algorithm on the Treebank-2 corpus, which contains 1 million words from the Wall 
Street Journal, 1989, and is considered a small corpus for the present task. As the MRD, we used 
a combination of the Webster, the Oxford and the WordNet online dictionaries (the latter used as 
a thesaurus only; see section |3.3| ). During the development and the tuning of the algorithm, we 
used the method of pseudo- words ( pale et al., 199^ ; ^chutze, 1992 ), to save the need for manual 
verification of the resulting sense tags. 

The final algorithm was tested on a total of 500 examples of four polysemous words: drug, 
sentence, suit, and player (see Table The relatively small number of polysemous words we 
studied was dictated by the size and nature of the corpus (we are currently testing additional 
words, using texts from the British National Corpus). 

The average success rate of our algorithm was 92%. The original training set (before the addition 
of the feedback sets) consisted of a few dozen examples, in comparison to thousands of examples 
needed in other corpus-based methods ( ^chutze, 1992| ; Yarowsky, 19951) . 



Results on two of the words on which we tested our algorithm {drug and suit) have been also 
reported in the works of Schutze and Yarowsky. It is interesting to compare the performance of 
the different methods on these words. On the word drug, our algorithm achieved performance of 
90.5%, after being trained on 148 examples (contexts). In comparison, ( Yarowsky, 1995| ) achieved 



7 



Table 1: A summary of the experimental results on four polysemous words. 



Word 


Senses 


Sample 


Feedback 


% correct 


% correct 






Size 


Size 


per sense 


total 


drug 


narcotic 


65 


100 


92.3 


90.5 




medicine 


83 


65 


89.1 




sentence 


judgement 


23 


327 


100 


92.5 




grammar 


4 


42 


50 




suit 


court 


212 


1461 


98.59 


94.8 




garment 


21 


81 


55 




player 


performer 


48 


230 


87.5 


92.3 




participant 


44 


1552 


97.7 





91.4% correct performance, using 1380 contexts and the dictionary definitions in training]^ On the 
word suit, our method achieved performance of 94.8%, using 233 training contexts; in comparison, 
( Schutze, 1992| ) achieved 95% correct performance, using 8206 contexts. In summary, our algorithm 
achieved performance comparable to some of the best reported results, using much less data for 
training. This feature of our approach is important, because the size of the available training set 
is usually severely constrained for most senses of most words (Gale et al., 1992). Finally, we note 
that, as in most corpus-based methods, supplying additional examples is expected to improve the 
performance. 

We now present in detail several of the results obtained with the word drug. A plot of the 
improvement in the performance vs. iteration number appears in Figure ^. The success rate is 
plotted for each sense, and for the weighted average of both senses we considered (the weights are 
proportional to the number of examples of each sense). 

Figure ^ shows how the similarity values develop with iteration number. For each example S of 
the narcotic sense of drug, the value of sim„(5, narcotic) increases with n. Note that after several 
iterations the similarity values are close to 1, and, because they are bounded by 1, they cannot 
change significantly with further iterations. 

Figure ^ compares the similarities of a narcotic example to the narcotic sense and to the medicine 
sense, for each iteration. The medicine sense assignment, made in the first iteration, has been 
corrected in the following iterations. 

Table ^ shows the most similar words found for the words with the highest weights in the drug 
example (low-similarity words have been omitted). Note that the similarity is contextual, and is 
affected by the polysemous target word. For example, trafficking was found to be similar to crime, 
because in drug contexts the expressions drug trafficking and crime are highly related. In general, 
trafficking and crime need not be similar, of course. 



^Yarowsky subsequently improved that result to 93.9%, using his "one sense per discourse" constraint. We expect 
that a similar improvement could be achieved if that constraint were used in conjunction with our method. 
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Word 


Most contextually similar words 


The medicine sense: 


J • J.' 

medication 


antibiotic blood prescription medicine percentage pressure 


prescription 


analyst antibiotic blood campaign introduction law line-up medication medicine 




1 ' 1 f 1 1 T ' i 1 1i* 111!*" 1 

percentage print profit publicity quarter sedative state television tranquilizer use 


medicine 


prescription campaign competition dollar earnings law manufacturing 




margin print product publicity quarter result sale saving sedative 




staff state television tranquilizer unit use 


disease 


antibiotic blood linc-up medication medicine prescription 


symptom 


1 1 * " 1 * • £ J. ^ ^- 

nypoglycemia msuim warning manufacturer product 




plant animal death diabetic evidence finding metabolism study 


msulm 


hypoglycemia manufacturer product symptom warning 




death diabetic finding report study 


tranquilizer 


campaign law medicine prescription print publicity sedative 




television use analyst profit state 


dose 


appeal dc^atli impact injury liability manufactuK^r iniscarriagx^ refusing ruling 




(li(>tli\istill)('strol liormoiK" (laiiiage (effect lomalc i)ros})(x i slat(> 


The narcofAc sense: 


consumer 


distributor effort cessation consumption country reduction requirement 




victory battle capacity cartel government mafia newspaper people 


mafia 


terrorism censorship dictatorship newspaper press brother nothing aspiration 




assassination editor leader politics rise action country doubt freedom 




mafioso medium menace solidarity structure trade world 


terrorism 


censorship doubt freedom mafia medium menace newspaper 




press solidarity structure 


murder 


capital-punishment symbolism trafficking furor killing substance crime 




restaurant law bill case problem 


menace 


terrorism freedom solidarity structure medium press censorship country doubt 




mafia newspaper way attack government magnitude people relation threat world 


trafficking 


crime capital-punishment furor killing murder restaurant substance symbolism 


dictatorship 


aspiration brother editor mafia nothing politics press 




assassination censorship leader newspaper rise terrorism 


assassination 


brother censorship dictatorship mafia nothing press terrorism 




aspiration editor leader newspaper politics rise 


laundering 


army lot money arsenal baron economy explosive government hand 




materiel military none opinion portion talk 


censorship 


mafia newspaper press terrorism country doubt freedom 




medium menace solidarity structure 



Table 2: The drug experiment; the nearest neighbors of the highest- weight words. The words in the entries 
are those with the highest weights, whose similarity values have, therefore, the greatest effect. Note that the 
similarity is contextual, and is highly dependent on the polyscmous target word. For example, trafficking 
was found to be similar to crime, because in the drug contexts the expressions drug trafficking and crime 
are highly related. In general, trafficking and crime need not be similar, of course. Also note that the 
similarity is affected by the training corpus. For example, in the Wall Street Journal, the word medicine is 
mentioned mostly in contexts of making profit, and in advertisements. Thus, in the medicine cluster there 
one finds words such as analyst, campaign, profit, quarter, dollar, which serve as hints for the medicine 
sense. Although profit and m,edicine are not closely related scmantically (relative to a more balanced corpus 
than WSJ), their contexts in the WSJ contain words that are similarly indicative of the sense of the target 
word. This kind of similarity, therefore, suits its purpose, which is sense disambiguation, although it may 
run counter to some of our intuitions regarding general semantic similarity. 



9 




1 23456789 10 



Figure 2: The drug experiment; the change in the disambiguation performance with iteration 
number is plotted separately for each sense. The asterisk marks the plot of the success rate for the 
narcotic sense. 



Similarity of S9nse1 examples to sensel feedback set 




Examples ° " iteration* 



Figure 3: The drug experiment; example runs, sorted by the second-iteration similarity values. 



3 Discussion 



We now discuss in some detail the choices made at the different stages of the development of the 
present method, and its relationship to some of the previous works on word sense disambiguation. 



3.1 Flexible sense distinctions 



The possibility of strict definition of each sense of a polysemous word, and the possibility of un- 
ambiguous assignment of a given sense in a given situation are, in themselves, nontrivial issues 
in philosophy ( Quine, 1960| ) and linguistics ( Weinreich, 1980 ; Cruse, 1986| ). Different dictionaries 
often disagree on the definitions; the split into senses may also depend on task at hand. Thus, it is 
important to maintain the possibility of flexible distinction of the different senses, e.g., by letting 
this distinction be determined by an external knowledge source such as a thesaurus or a dictionary. 
Although this requirement may seem trivial, most corpus-based methods do not, in fact, allow 



such flexibility. For example, defining the senses by the possible translations of the word ( Pagan 
and Itai, 1991 ; [Brown et al., 199"! ; pale et al., 1992] ), by the Roget's categories ( Yarowsky, 1992 ), 
or by clustering ( ^chutze, 1992)) yields a grouping that does not always conform to the desired 
sense distinctions. In comparison, our reliance on the MRD the definition of senses in the initial- 
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Figure 4: The drug experiment; the similarity between a narcotic-sense example to each of the two 
senses. The asterisk marks the plot for the narcotic sense. The sentence was The American people 
and their government also woke up too late to the menace drugs posed to the moral structure of 
their country. 

The word menace which is a hint for the narcotic sense in this sentence, did not help in the first 
iteration, because it did not appear in the narcotic feedback set at all. Thus, in iteration 1, the 
similarity of this sentence to the medicine sense was 0.15, vs. similarity of 0.1 to the narcotic sense. 
In iteration 2, menace was learned to be similar to other narcotic-related words, yielding a small 
advantage for the narcotic sense. In iteration 3, further similarity values were updated, and there 
was a clear advantage to the narcotic sense (0.93, vs. 0.89 for medicine). 



ization of the learning process guarantees the required flexibility in setting the sense distinctions. 
Pure MRD-based methods allow the same flexibility, but do not yield good results, because the 
definitions alone do not contain enough information for disambiguation. 



3.2 Sentence features 

Different polysemous words may benefit from different types of features of the context sentences. 
Polysemous words for which distinct senses tend to appear in different topics can be disambiguated 
using single words as the context features, as we did here. Disambiguation of other polysemous 
words may require taking the sentence structure into account, using n- grams or syntactic constructs 
as features. This additional information can be incorporated into our method, by (1) extracting 
features such as nouns, verbs, adjectives of the target word, bi-grams, tri-grams, and subject-verb 
or verb-object pairs, (2) discarding features with a low weight (cf. section [A. 3D , and (3) using the 
remaining features instead of single words (i.e., by representing a sentence by the set of significant 
features it contains, and a feature — by the set of sentences in which it appears). 



3.3 Using WordNet 

The initialization of the word similarity matrix using WordNet (a hand-crafted semantic network 
arranged in a hierarchical structure; ( [Miller et al., 1993| )) may seem to be advantageous over simply 
setting it to the identity matrix, as we have done. To compare these two approaches, we tried to set 



the initial similarity between two words to the WordNet path length between their nodes (Lee et al. 



1993 ), and then learn the similarity values iteratively. This, however, led to worse performance 



than the simple identity-matrix initialization. 
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There are several possible reasons for the poor performance of WordNet in this comparison. 
First, WordNet is not designed to capture contextual similarity. For example, in WordNet, hospital 
and doctor have no common ancestor, and hence their similarity is 0, while doctor and lawyer 
are quite similar, because both designate professionals, humans, and living things. Note that, 
contextually, doctor should be more similar to hospital than to lawyer. Second, we found that the 
WordNet similarity values dominated the contextual similarity computed in the iterative process, 
preventing the transitive effects of contextual similarity from taking over. Third, the tree distance 
in itself does not always correspond to the intuitive notion of similarity, because different concepts 
appear at different level of abstraction, and have a different number of nested sub-concepts. For 
example, a certain distance between two nodes may result from: (1) the nodes being semantically 
close, but separated by a large distance, stemming from a high level of detail in the related synsets, 
or from (2) the nodes being semantically far from each other.^ 

3.4 Ignoring irrelevant examples 

The feedback sets we use in training the system may contain noise, in the form of irrelevant examples 
that are collected along with the relevant and useful ones. For instance, in one of the definitions of 
bank in WordNet, we find bar, which, in turn, has many other senses that are not related to bank. 
Although these unrelated senses contribute examples to the feedback set, our system is hardly 
affected by this noise, because we do not collect statistics on the feedback sets (i.e., our method 
is not based on mere cooccurrence frequencies, as most other corpus-based methods are). The 
relevant examples in the feedback set of the sense Sj will attract the examples of Sj; the irrelevant 
examples, will not attract the examples of Sj, but neither will they do damage, because they are 
not expected to attract examples of (j / i). 

3.5 Related work 

3.5.1 The knowledge acquisition bottleneck 

Brown et al. (1991) and Gale et al. (1992) used the translations of the ambiguous word in a 
bilingual corpus as sense tags. This does not obviate the need for manual work, as producing 
bilingual corpora requires manual translation workj^ 

Dagan and Itai (1991) used a bilingual lexicon and a monolingual corpus, to save the need for 
translating the corpus. The problem remains, however, that the word translations do not necessarily 
overlap with the desired sense distinctions. 

Schutze (1992) clustered the examples in the training set, and manually assigned each cluster 
a sense by observing 10-20 members of the cluster. Each sense was usually represented by several 
clusters. Although this approach significantly decreased the need for manual intervention, about a 
hundred examples had still to be tagged manually for each word. Moreover, the resulting clusters 
did not necessarily correspond to the desired sense distinctions. 

Yarowsky (1992) learned discriminators for each Roget's category, saving the need to separate 
the training set into senses. However, using such hand-crafted categories usually leads to a coverage 
problem for specific domains, or for domains other than the one for which the list of categories has 
been prepared. 

^Resnik (1995) recently suggested to overcome this particular difficulty by a diflerent measure that takes into 
account the informal iveness of the most specific common ancestor of the two words. 

^MRD's are, of course, also constructed manually, but, unlike bilingual corpora, these are existing resources, made 
for general use. 
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Using MRDs for WSD was su ggested in ([Lesk, 1986|); se veral researchers subsequently continued 
and improved this line of work ( [Krovetz and Croft, 1989 ; Guthrie et al., 199 1| ; Veronis and Ide, 



1990). Unlike the information in a corpus, the information in the MRD definitions is presorted into 



senses. However, as noted above, the MRD definitions alone do not contain enough information 
to allow reliable disambiguation. Recently, Yarowsky (1995) combined a MRD and a corpus in 
a bootstrapping process. In that work, the definition words were used as initial sense indicators, 
tagging automatically the target word examples containing them. These tagged examples were then 
used as seed examples in the bootstrapping process. In comparison, we suggest to combine further 
the corpus and the MRD by use all the corpus examples of the MRD definition words, instead of 
those words alone. This yields much more sense-presorted training information. 



3.5.2 The problem of sparse data 



Most previous works define word similarity based on cooccurrence information, and hence face a 
severe problem of sparse data. Many of the possible cooccurrences are not observed even in a very 
large corpus ( phurch and Mercer, 1993 ). Our algorithm addresses this problem in two ways. First, 
we replace the all-or-none indicator of cooccurrence by a graded measure of contextual similarity. 
Our measure of similarity is transitive, allowing two words to be considered similar even if they are 
neither observed in the same sentence, nor share neighbor words. Second, we extend the training 
set by adding examples of related words. The performance of our system compares favorably to 
that of systems trained on sets larger by a factor of 100 (the results described in section ^ were 
obtained following learning from several dozen examples, in comparison to thousands of examples 
in other automatic methods). 

Traditionally, the problem of sparse data is approached by estimating the probability of un- 
observed cooccurrences using the actual cooccurrences in the training set. This can be done by 
smoothing the observed frequencies ( Church and Mercer, 199^), or by class-ba s ed methods ([Brown 
et al, 19911 ; [Pereira and Tishby, 1992| ; [Pereira et al., 1993| ; [Hirschman, 1986| ; [Resnik, 1992| ; |Bril] 
et al., 199q ; Pagan et al., 1993| ). In comparison to these approaches, we use similarity information 
throughout training, and not merely for estimating cooccurrence statistics. This allows the system 
to learn successfully from very sparse data. 



3.6 Summary 

We have described an approach to WSD that combines a corpus and a MRD to generate an extensive 
data set for learning similarity-based disambiguation. Our system combines the advantages of 
corpus-based approaches (large number of examples) with those of the MRD-based approaches 
(data pre-sorted by senses), by using the MRD definitions to direct the extraction of training 
information (in the form of feedback sets) from the corpus. 

In our system, a word is represented by the set of sentences in which it appears. Accordingly, 
words are considered similar if they appear in similar sentences, and sentences are considered similar 
if they contain similar words. Applying this definition iteratively yields a transitive measure of 
similarity under which two sentences may be considered similar even if they do not share any 
word, and two words may be considered as similar even if they do not share neighbor words. Our 
experiments show that the resulting alternative to raw cooccurrence-based similarity leads to better 
performance on very sparse data. 
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A Appendix 

A.l Stopping conditions of the iterative algorithm 

Let fi be the increase in the similarity value in iteration i: 

fi{x,y) = simix, y) - simi_i(^, y) (9) 

where X, Y can be either words or sentences. For each item X, the algorithm stops updating its 
similarity values to other items (that is, updating its row in the similarity matrix) in the first 
iteration that satisfies maxyfi{X,y) < e, where e > is a preset threshold. 

According to this stopping condition, the algorithm terminates after at most j iterations (oth- 
erwise, in i iterations with each fi > e, we obtain sim{X, 3^) > e • ^ = 1, in contradiction to upper 
bound of 1 on the similarity values; see section |A.2D . 

We found that the best results are obtained within three iterations. After that, the disam- 
biguation results tend not to change significantly, although the similarity values may continue to 
increase. Intuitively, the transitive exploration of similarities is exhausted after three iterations. 

A. 2 Proofs 

In the following, X, Y can be either words or sentences. 
Theorem 1 Similarity is bounded: sininiX, y) < 1 



Proof 1 By induction on the number of iteration. At the first iteration, simo(Af, y) < 1, by 
initialization. Assume that the claim holds for n, and prove for n + \: 



simn+i{x,y) = weight{xj,x) max simn{xj,yk) 

< ^ weight{xj,x) ■ 1 (by the induction hypothesis) 



1 



Theorem 2 Similarity is reflexive: WX, sim(X, X) = 1 



Proof 2 By induction on the number of iteration. simo(^, X) = 1, by initialization. Assume that 
the claim holds for n, and prove for n+ 1: 

sim„+i(^, X) = ^ weight(Afj, r^") • max sim„(^j, Xj) 

x^^x ^^^^ 

> ^ weight (A'j, X)-simniXi, Xi) 

= ^ weight (Ai, X) ■ 1 (by the induction hypothesis) 
= 1 

Thus, sim„_|_i(<Y, X) > 1. By theorem 1, sim„4.i(^, X) < 1, so sim„+i(Af, X) = 1. 
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Theorem 3 Similarity siin„(Af, y) is a non- decreasing function of the number of iteration n. 

Proof 3 By induction on the number of iteration. Consider the case of n = 1; simi(^, y) > 
simo(A', y) (if simo(X , y) = 1, then X = y, and, simi(A', y) = 1 as well; else simo(A', 3^) = 
and simi(^, 3^) > = simo(Af, y) ). Now, assume that the claim holds for n, and prove for n + 1: 

siin„+i(A', y) - simn{X, y) = 

= ^ weight(A'j, A") ■ max sim„(Aj, 3^fe) — ^ weight(A'j, A") • max sim„_i(Aj, 
Xjex '^''^•^ XjGX •^''^^ 

> V weight(A'_,-, X) ■ ( max sim„(A'j, 3^^) - max sim„_i(A'j-, 3^^) 

xex yyk&y yk&y 

> 

The last inequality hold because, by the induction hypothesis, 

yXj,yk, siiRniXj, y^) > sim„_i(;fj, yk) 

max sim„(A:',-, 3^^) > max sim„_i(;tj, 3^^) 

Vk^y yk^y 

maxsim„(Aj, 3^11;) — max sim„_i(A'j, 3^^) > 

Thus, all the items under the sum are nonnegative, and so must be their weighted average. As a 
consequence, we may conclude that the iterative estimation of similarity converges. 

A. 3 Word weights 

In our algorithm, the weight of a word estimates its expected contribution to the disambiguation 
task, and the extent to which the word is indicative in sentence similarity. The weights do not 
change with iterations. They are used to reduce the number of features to a manageable size, and 
to exclude words that are expected to be given unreliable similarity values. The weight of a word is 
a product of several factors: frequency in the corpus, the bias inherent in the training set, distance 
from the target word, and part of speech label: 

1. Global frequency. Frequent words are less informative of the sense and of the sentence similar- 
ity (e.g., the appearance of this in two different sentences does not indicate similarity between 
them, and does not indicate the sense of any target word). The contribution of frequency is 
max{0, 1 — maxS^^freqCA') ' ^here max5;t'freq(-%') is a function of the five highest frequencies in 
the corpus. This factor excludes only the most frequent words from further consideration. As 
long as the frequencies are not very high, it does not label Wi whose frequency is twice that 
of as less informative. 

2. Log likelihood factor. Words that are indicative of the sense usually appear in the training set 
more than what would have been expected from their frequency in the general corpus. The 
log likelihood factor captures this tendency. It is computed as 

Pr(n^ (10) 
^ Pr(>Vi) ^ ^ 



17 



where Pr(>Vj) is estimated from the frequency of W in the entire corpus, and Pr(yVj | W) — 
from the frequency of Wj in the training set, given the examples of the current ambiguous 
word W (cf. ( pale et al., 1992 )).p] To avoid poor estimation for words with a low count in 



the training set, we multiply the log likelihood by min{l, } where count{W) is the 

number of occurrences of W in the training set. 

3. Part of speech. Each part of speech is assigned an initial weight (1.0 for nouns and 0.6 for 
verbs). 

4. Distance from the target word. Context words that are far from the target word are less 
indicative than nearby ones. The contribution of this factor is reciprocally related to the 
normalized distance. 



The total weight of a word is the product of the above factors, each normalized by the sum of 
factors of the words in the sentence: weight(>Vi, S) = f actor (w S) ' ^^^^^ factor(., .) is the 



weight before normalization. 



A. 4 Other uses of context similarity 

The similarity measure developed in the present paper can be used for tasks other than word sense 
disambiguation. Here, we illustrate a possible application to automatic construction of a thesaurus. 

Following the training phase for a word X, we have a word similarity matrix for the words in 
the contexts of X. Using this matrix, we construct for each sense Sj of X a set of related words, R: 

1. Initialize R to the set of words appearing in the MRD definition of s^; 

2. Extend R recursively: for each word in R added in the previous step, add its k nearest 
neighbors, using the similarity matrix. 

3. Stop when no new words (or too few new words) are added. 

Upon termination, output for each sense Sj the set of its contextually similar words R. 



^Because this estimate is unreliable for words with low frequencies in each sense set, Gale et al. (1992) suggested to 
interpolate between probabilities computed within the sub-corpus and probabilities computed over the entire corpus. 
In our case, the denominator is the frequency in the general corpus instead of the frequency in the sense examples, 
so it is more reliable. 
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